NVIDIA Performance Tools Explorer

Advanced Application Profiling Techniques

This section focuses on specific Nsight Systems techniques for in-depth analysis of application performance, covering GPU hardware-level metrics and Python application behavior to uncover complex bottlenecks.

GPU Hardware Profiling with Nsight Systems

What it is: Utilizing Nsight Systems to collect detailed GPU hardware performance counters and metrics. This goes beyond API call tracing to understand how the application is actually utilizing the GPU's architectural components.

Key Use Cases:

Identifying if an application is truly GPU-bound by examining SM (Streaming Multiprocessor) occupancy and activity, Tensor Core usage, and other execution unit statuses.
Understanding GPU memory bandwidth utilization (device memory, L1/L2 cache).
Diagnosing GPU stalls related to memory latency, instruction fetching, or execution dependencies within kernels.
Correlating periods of high/low GPU hardware activity with specific CUDA kernels or graphics API calls on the timeline.

Core Benefit: Provides a deeper understanding of application-GPU hardware interaction, revealing opportunities for optimization by modifying kernel code, launch parameters, or data access patterns. It helps answer "why" a GPU is performing a certain way.

Note: Collecting detailed GPU metrics can introduce higher profiling overhead compared to API tracing alone. Choose metrics and frequency wisely.

After Nsight Systems with GPU metrics helps identify problematic kernels or GPU activity patterns, NVIDIA Nsight Compute is typically used for an even more granular, microarchitectural analysis of those individual CUDA kernels.

Example Command(s) for Nsight Systems GPU Metrics:

# Profile and collect GPU metrics from all available/compatible GPUs
nsys profile --gpu-metrics-device=all ./my_application

# Collect GPU metrics from a specific GPU (e.g., GPU 0) along with CUDA and NVTX trace
nsys profile -t cuda,nvtx --gpu-metrics-device=0 -o gpu_metrics_report ./my_application

# Set GPU metrics collection to high frequency (more detail, more overhead)
# Check available frequencies with: nsys GpuProfiling --print-frequencies
nsys profile --gpu-metrics-device=all --gpu-metrics-frequency=high ./my_application

# To list available GPU metric sets for your hardware (if applicable):
# nsys GpuProfiling --print-metric-sets
# Then use a specific set:
# nsys profile --gpu-metrics-device=all --gpu-metrics-set="<your_set_name>" ./my_application

Python Application Profiling with Nsight Systems

What it is: Using Nsight Systems to profile Python applications, enabling visualization of Python function calls on the timeline and their correlation with underlying system activity, including CPU usage, OS calls, and GPU work if NVIDIA libraries are used.

Key Use Cases:

Identifying performance bottlenecks within Python code sections.
Understanding the overhead of Python interpreter versus native C/C++ or GPU execution when using libraries like PyTorch, TensorFlow, CuPy, or Numba.
Visualizing the interaction between Python scripts and the CUDA kernels or GPU libraries they invoke.
Analyzing CPU core utilization and OS interactions (e.g., I/O, threading) stemming from Python code.
Sampling Python call stacks to pinpoint hot functions.

Core Benefit: Helps optimize Python applications that leverage GPUs by providing insights into both Python-level execution and its impact on GPU-level performance, bridging the gap between high-level scripting and hardware execution.

Example Command(s) for Nsight Systems Python Profiling:

# Basic profiling of a Python script
nsys profile python my_script.py arg1 arg2

# Profile Python with CUDA, NVTX, OS runtime tracing, and Python call stack sampling enabled
# (-w true is for wait for process, useful for some Python setups)
nsys profile -w true -t cuda,nvtx,osrt --python-sampling=true -o python_report python my_script.py

# If using a specific Python interpreter or virtual environment
nsys profile /path/to/my/venv/bin/python my_script.py

# Increase sampling frequency for Python call stacks (default is 1000 Hz, higher can give more detail)
nsys profile --python-sampling=true --python-sampling-freq=2000 python my_script.py

# Trace specific Python modules or functions using NVTX annotations in your Python code
# (Requires adding import cupy.cuda.nvtx or similar and using nvtx.RangePush/Pop)
nsys profile -t nvtx,cuda python my_nvtx_annotated_script.py

Core Profiling & Analysis Tools

This section introduces NVIDIA's primary tools for detailed analysis of GPU and overall system performance. These tools help you understand application behavior at both a high level and a granular kernel level.

NVIDIA Nsight Systems: System-Wide Bottleneck ID

What it is: A system-wide performance analysis tool that visualizes an application's algorithm and interactions between CPUs, GPUs, OS, and APIs on a unified timeline.

Key Use Cases:

Identifying if an application is CPU-bound, I/O-bound, or lacks parallelism.
Understanding CPU-GPU interactions and data transfers.
Initial assessment of network communication and multi-node performance.
Correlating activities across CPUs, GPUs, NICs, and OS.

Core Benefit: Serves as a "first-look" diagnostic to pinpoint the largest optimization opportunities by exposing system-level bottlenecks with low overhead.

Nsight Systems guides developers toward the true sources of inefficiency, preventing premature optimization of incorrect components.

Example Command(s):

# Basic profiling of an application
nsys profile ./my_application arg1 arg2

# Profile with specific traces (CUDA, NVTX) and statistics enabled
nsys profile -t cuda,nvtx --stats=true -o my_report ./my_application

NVIDIA Nsight Compute: Deep Dive into CUDA Kernels

What it is: An interactive kernel profiler for detailed analysis of CUDA and NVIDIA OptiX kernels.

Key Use Cases:

Optimizing specific CUDA kernels by examining GPU throughput, warp state, instruction execution, and memory access patterns.
API debugging and correlating performance data to source code (CUDA C/C++, PTX, SASS).
Receiving guided analysis and actionable recommendations based on NVIDIA best practices.

Core Benefit: Provides indispensable granular detail to maximize individual GPU kernel efficiency, crucial after Nsight Systems identifies a GPU-bound workload.

Nsight Compute helps resolve performance limiters like instruction latency, memory bandwidth constraints, or low SM occupancy.

Example Command(s):

# Profile all kernels in an application and save to a report file
ncu -o my_kernel_report ./my_application

# Profile a specific kernel by name (regex supported)
ncu --kernels my_specific_kernel_name_regex ./my_application

# Collect a specific set of metrics/sections (e.g., MemoryWorkloadAnalysis)
ncu --set full --section MemoryWorkloadAnalysis ./my_application

Cluster Management & Monitoring

For large-scale deployments, managing and monitoring the health and performance of GPUs across the cluster is vital. This section focuses on tools designed for this purpose.

NVIDIA Data Center GPU Manager (DCGM)

What it is: A suite for managing and monitoring NVIDIA GPUs in large-scale data center and cluster environments.

Key Use Cases:

Continuous, cluster-wide operational health monitoring and diagnostics.
System alerts and governance policies.
Gathering rich GPU telemetry (utilization, memory, power, clocks, temperature, PCIe/NVLink errors).
Integration with cluster management tools like Kubernetes (via DCGM-Exporter), Prometheus, and Grafana.

Core Benefit: Ensures GPUs are functioning correctly and helps identify systemic issues or resource underutilization across the data center, improving reliability and uptime.

DCGM provides the continuous oversight needed to ensure the underlying hardware infrastructure is sound, complementing application-specific profiling.

Example Command(s):

# List all discoverable GPUs on the system
dcgmi discovery -l

# Get health status for all GPUs (default group)
dcgmi health -g 1 # Assuming group 1 is the default group of all GPUs

# Run short diagnostics (level 1) on GPU 0
dcgmi diag -g 1 -i 0 -r 1

# Continuously monitor specific GPU metrics (e.g., power, temp, utilization) for GPU 0
# Field IDs: 203 (power), 252 (temp), 1004 (GPU util), 1005 (Mem util)
dcgmi dmon -g 1 -i 0 -e 203,252,1004,1005

GPU Memory Testing

Testing GPU memory is crucial for ensuring hardware stability, data integrity, and identifying potential faults. Various tools can help diagnose memory issues, from hardware errors to access problems in CUDA code.

DCGM Diagnostics (Memtest)

Focus: Hardware fault detection and stability, particularly for server/data center GPUs.

Key Use Cases: Runs various tests on GPU memory to identify problematic memory modules or hardware faults.

Core Benefit: Helps ensure memory hardware integrity. Higher diagnostic levels provide more thorough testing.

Example Command(s):

# Run Level 2 diagnostics (includes memory tests) on all GPUs in group 1
sudo dcgmi diag -g 1 -r 2

# Run Level 3 diagnostics (more intensive memory tests)
sudo dcgmi diag -g 1 -r 3

nvidia-smi (for ECC Memory)

Focus: Reporting ECC (Error Correcting Code) memory errors for GPUs that support this feature.

Key Use Cases: Checking for single-bit (corrected) and double-bit (uncorrected) ECC errors. Monitoring retired memory pages due to persistent errors.

Core Benefit: Provides visibility into the health and error status of ECC-enabled GPU memory.

Example Command(s):

# Query memory details, ECC status, and page retirement info
nvidia-smi -q -d MEMORY,ECC,PAGE_RETIREMENT

# Continuously monitor ECC errors (refreshes every second)
watch -n 1 "nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total --format=csv,noheader"

NVIDIA Compute Sanitizer (memcheck)

Focus: Debugging memory access errors within CUDA applications at the code level.

Key Use Cases: Detecting out-of-bounds global/shared memory accesses, misaligned accesses, uninitialized memory usage, and other memory-related bugs in CUDA kernels.

Core Benefit: Helps CUDA developers write more robust and correct code by pinpointing memory safety issues.

Example Command(s):

# Run memcheck tool on a CUDA application
compute-sanitizer --tool memcheck ./my_cuda_application

# Generate a report file
compute-sanitizer --tool memcheck --log-file memcheck_report.log ./my_cuda_application

Nsight Compute (Memory Performance)

Focus: Analyzing memory access patterns and performance within CUDA kernels, not for error detection but for optimization.

Key Use Cases: Identifying memory throughput bottlenecks, high-latency memory operations, inefficient cache utilization, and understanding how kernels interact with different levels of GPU memory.

Core Benefit: Guides optimization of kernel memory access to improve performance.

Example Command(s):

# Collect detailed memory workload analysis sections
ncu --set full --section MemoryWorkloadAnalysis --section SpeedOfLight_HierarchicalDoublePrecision ./my_cuda_kernel_app

# Focus on L1/L2 cache metrics
ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_hit.sum,l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_miss.sum ./app

Note: Third-party tools like GpuMemTest (Windows), memtest_vulkan (cross-platform), or OCCT (Windows) also offer GPU memory stress testing capabilities, which can be useful for finding hardware instabilities.

Optimizing CPU-GPU Data Transfers

Efficient data transfer between the CPU (host) and GPU (device) memory is critical for the performance of many GPU-accelerated applications. Bottlenecks here can leave the GPU waiting for data. This section covers tools to analyze and optimize these transfers over the PCIe bus.

Nsight Systems (Memory Transfer Analysis)

Focus: Visualizing and timing memory copy operations (e.g., `cudaMemcpy`, OpenACC data transfers) on the system timeline.

Key Use Cases: Identifying large or frequent CPU-GPU transfers, assessing if transfers overlap with computation, pinpointing PCIe bandwidth saturation, and analyzing the impact of memory transfer latency.

Core Benefit: Provides a clear view of data movement between host and device, highlighting inefficiencies that can be addressed by techniques like using pinned memory, asynchronous transfers, or reducing data volume.

Example Command(s):

# Profile CUDA memory operations along with other system activity
nsys profile -t cuda,nvtx,osrt -o transfer_analysis_report ./my_app_with_transfers

# Focus on CUDA API calls, including memory transfers
nsys profile --trace=cuda ./my_app_with_transfers

nvbandwidth / bandwidthTest (PCIe Benchmark)

Focus: Benchmarking the peak achievable bandwidth of the PCIe bus for host-to-device (HtoD) and device-to-host (DtoH) transfers.

Key Use Cases: Verifying that the PCIe hardware configuration (e.g., link width, speed) allows for expected transfer rates. Identifying if the system is underperforming in terms of raw transfer capability.

Core Benefit: Establishes a baseline for maximum PCIe throughput, helping to distinguish between hardware limitations and application-level transfer inefficiencies.

Example Command(s) (using CUDA Sample `bandwidthTest`):

# (Typically found in CUDA Samples: /usr/local/cuda/samples/1_Utilities/bandwidthTest)
# Test all memory types and transfer directions
./bandwidthTest

# Test host-to-device bandwidth using pinned memory
./bandwidthTest --memory=pinned --mode=htod

# Test device-to-host bandwidth using pageable memory
./bandwidthTest --memory=pageable --mode=dtoh

DCGM (PCIe Telemetry)

Focus: Monitoring PCIe bandwidth utilization (bytes transmitted/received over PCIe) and link error counts at a system/cluster level.

Key Use Cases: Spotting GPUs with consistently saturated PCIe links or links accumulating errors, which could impact CPU-GPU transfer performance.

Core Benefit: Provides ongoing monitoring of PCIe health and throughput, complementing Nsight Systems' application-specific trace.

Example Command(s):

# Monitor PCIe transmit and receive bytes for GPU 0 in group 1
# Field IDs: 1006 (PCIe Tx Bytes), 1007 (PCIe Rx Bytes)
dcgmi dmon -g 1 -i 0 -e 1006,1007

# Get PCIe link generation and width for all GPUs
dcgmi discovery -c # Lists capabilities including PCIe gen and width

GPUDirect Technology (Conceptual)

Focus: A suite of technologies enabling data transfers between GPUs and other devices (NICs, storage) to bypass the CPU and system memory, reducing latency and CPU overhead.

Key Use Cases: GPUDirect RDMA for network transfers, GPUDirect Storage for direct access to NVMe drives. While not a "tool" itself, applications and libraries utilize these technologies.

Core Benefit: Significantly improves performance for I/O intensive workloads by creating more direct data paths to/from the GPU.

Nsight Systems can help observe the benefits of GPUDirect (e.g., lower CPU usage during transfers, faster I/O operations) when it's effectively employed by the application or its libraries.

Analyzing High-Speed Interconnects

The performance of high-speed interconnects like InfiniBand, high-speed Ethernet (RDMA), and NVIDIA NVLink is critical for applications that scale across multiple GPUs and nodes. This section delves into tools for profiling and optimizing these vital communication pathways.

InfiniBand & High-Speed Ethernet (RDMA)

Tools in this category help diagnose and optimize network communication for distributed applications using InfiniBand or Ethernet with RDMA capabilities.

Nsight Systems for Network Communication

Focus: Extends system-wide view to network communications.

Traces MPI and UCX library calls, samples NIC metrics.
Visualizes network traffic patterns (message sizes, frequency) with CPU/GPU activity.
Identifies communication stalls, congestion, idle NIC/HCA activity.
Supports RoCE, InfiniBand; can sample switch metrics and port congestion events.
Integrates with `ibdiagnet` for topology information.

Vital for understanding how network behavior impacts overall application throughput.

Example Command (within Nsight Systems profiling):

# Enable MPI and network tracing when profiling with Nsight Systems
nsys profile -t mpi,nvtx,cuda --capture-IB-hw-counters=true -o report_with_network ./my_mpi_app

InfiniBand-Specific Utilities (MLNX_OFED/UFM)

Focus: Comprehensive InfiniBand fabric management and diagnostics.

Includes `ibdiagnet` (fabric discovery/health), `ibqueryerrors` (port errors), `ibnetdiscover` (topology), `ibstatus` (HCA status).
Used for low-level fabric health checks, physical layer issues, topology/routing verification.
Assesses link congestion via counters like `XmtWait` (from `perfquery`).

Primary diagnostic tools for suspected network hardware or configuration issues.

Example Command(s):

# Discover InfiniBand topology
ibnetdiscover

# Check status of local InfiniBand HCAs
ibstatus

# Run comprehensive fabric diagnostics (can be disruptive, use with caution)
# sudo ibdiagnet -r

# Query port error counters for all active ports (and clear them after display)
ibqueryerrors -c

NVIDIA NVLink Profiling & Optimization

NVLink provides high-bandwidth, low-latency direct GPU-to-GPU interconnect. These tools help analyze and optimize its performance, crucial for multi-GPU scaling.

Nsight Systems & Nsight Compute for NVLink Analysis

Nsight Systems: Visualizes high-level NVLink traffic utilization on GPU metrics timeline, correlating with kernels/memory transfers.

Nsight Compute: Provides detailed NVLink metrics (topology, per-link bandwidth, hardware counters) for granular analysis of imbalances or contention.

A combined approach for comprehensive NVLink performance characterization.

Example Command(s):

# Nsight Systems: NVLink metrics are often collected by default with GPU tracing.
nsys profile -t cuda,nvtx -o nvlink_sys_report ./my_multigpu_app

# Nsight Compute: Collect NVLink traffic metrics for a kernel
ncu --metrics nvlink_total_data_transmitted,nvlink_total_data_received ./my_multigpu_app

Specialized NVLink Benchmarking Utilities

`nvbandwidth`: Measures memory bandwidth/latency between components (GPU-GPU over NVLink/PCIe, CPU-GPU).

`NCCL-Tests`: Evaluates performance and correctness of NCCL operations (e.g., all-reduce) that heavily rely on NVLink.

Vital for baseline performance checks and hardware integrity verification.

Example Command(s):

# nvbandwidth: Run all default tests (CUDA Sample: bandwidthTest)
./bandwidthTest

# NCCL-Tests: Run all_reduce_perf test
./build/all_reduce_perf -b 8 -e 128M -f 2 -g $(nvidia-smi -L | wc -l)

DCGM for NVLink Monitoring

Focus: Monitors low-level health of NVLink connections.

Tracks status and error counters over time.
Helps detect faulty links or links with high error rates.

Provides an early warning for potential hardware issues affecting NVLink.

Example Command (DCGM):

# Get NVLink status for all GPUs in group 1
dcgmi nvlink -g 1 -s

# Get NVLink error counts for all GPUs in group 1
dcgmi nvlink -g 1 -e

Tool Strength Comparison

This chart offers a comparative overview of the primary strengths of key NVIDIA tools across different performance analysis domains. Higher values indicate stronger applicability or focus in that area. (Scale: 1-Low, 3-Medium, 5-High)